Dyr og Data

Statistical thinking — generalized linear models

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2025-10-31

Introduction

Introduction

A Gaussian linear models describes the expected value (mean) of the response as a linear combination of the predictor variables

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \cdots + \beta_j x_{ij} + \varepsilon_i\]

Implies that a constant change in a predictor leads to a constant change in the response

This is intuitive for a response that has a normal distribution: can vary effectively indefinitely in either direction, or over small range (e.g. heights)

This is inappropriate for a wide range of types of response variables common in animal science data sets

Introduction

Animal science data are often discrete or categorical

Binary (dichotomous) variables

  • Suffering from a particular disease
  • Obese or normal weight

Count variables

  • number of disease recordings per herd
  • number of discrete play behaviours observed

Also continuous, positive (or non-negative) data such as

  • concentrations of particular compounds in blood
  • weight of an animal
  • milk yield

What are we doing?

When we model we make assumptions regardless of the model we fit — the Gaussian linear models make several assumptions

One assumption is that we want to model each value of the response \(y_i\) as being a draw from a Gaussian random variable with mean \(\mu_i\) and standard deviation \(\sigma\)

We write this as

\[ y_i \sim \mathcal{N}(\mu_i, \; \sigma) \]

For \(\sim\) read is distributed as and \(\mathcal{N}\) is the is normal or Gaussian distribution

What are we doing?

The mean of the Gaussian random variable is the expected value of \(y_i\) given a value of \(x_i\)

By expected value we mean the typical or most likely value

If we are being fancy, we write this expected value as \(\mathbb{E}(y_i)\)

And we state: \(\mathbb{E}(y_i) = \mu_i\)

The expected value of \(y_i\) is equal to \(\mu_i\), the mean of the Gaussian random variable

What are we doing?

We model this expected value through the linear predictor

But instead thinking of us modelling \(y_i\) directly like we have before

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \cdots + \beta_j x_{ij} + \textcolor{red}{\varepsilon_i}\]

we can think of modelling the expected value of \(y_i\)

\[\mathbb{E}(y_i) = \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \cdots + \beta_j x_{ij}\]

Now we don’t have those model errors (\(\textcolor{red}{\varepsilon_i}\))anywhere in the linear predictor

But where did the errors \(\textcolor{red}{\varepsilon_i}\) go?

To answer that, let’s draw some pictures of what we are doing

What are we doing?

We have some data \(x_i\) that we use to model \(y_i\)

Figure 1

What are we doing?

We fit the model \(\mathbb{E}(y_i) = \mu_i = \beta_0 + \beta_1 x_i\) by finding the linear that minimises the sum of squared errors

Figure 2

What are we doing?

But we assume that each data point \(y_i\) comes from a Gaussian distribution with mean \(\mu_i\) and standard deviation \(\sigma\)

Figure 3

What are we doing?

Orange line = fitted relationship

Red dots = fitted value \(\hat{y}_i\)

Green dots = selected observed \(y_i\)

Black lines = fitted Gaussian distribution for each green dot

Note the width of each Gaussian is the same: we assume that the variance of the data \(y_i\) is the same for all observations

Note that the Gaussians extend into negative values of \(\mathbf{y}\)

Logistic Regression

Logistic Regression

A binary or dichotomous variable can take only one of two values

Numerically these are coded as 0 and 1 when we fit a logistic regression

  • a value of 1 is generally known as the success state; more usefully this indicates cases where risk factor or outcome of interest is present,
  • a value of 0 is the failure state; more usefully the absence of the risk factor or outcome of interest

Alternatively, we may have a data representing a proportion of successes from some total in a set of \(m\) trials or experiments

Such variables are strictly bounded at 0 and 1

Logistic Regression

Association between high somatic cell count (SCC) & milk yield in dairy cows

Define high SCC as SCC > 125,000 cells ml-1

Linear regression line could predict negative probability of high SCC

Variance unlikely to be constant; reduced at 0 & 1

scc <- read_csv("data/itve-milk-scc/milk-scc.csv") |>
  rename(
    milk_yield_kg = kgmilk,
    cell_count = cellcount,
    cow_od = cowid
  ) |>
  mutate(
    high_scc = cell_count > 125,
    breed = factor(
      breed,
      levels = c(1,2,3,8),
      labels = c("Red Danish", "Holstein", "Jersey", "Cross-breed")),
    parity = factor(parity))

Logistic regression

p1 <- ggplot(scc, aes(x = milk_yield_kg, y = as.numeric(high_scc))) +
    geom_jitter(position = position_jitter(height = 0.05)) +
    ylab("High somatic cell count?") + xlab("Milk yield (kg)")
p1 + geom_smooth(method = "lm", se = FALSE)

Logistic Regression

Would like to fit an s-shaped curve to the data but we want a linear equation

Can achieve both aims via a transformation of the s-shaped response to a straight line: logit transformation

Logistic Regression

The logistic regression is given by the following equation

\[\log \left ( \frac{p}{1 - p} \right ) = \beta_0 + \beta_{1} x_1\]

\(\displaystyle \frac{p}{1 - p}\) is the odds, a measure that compares the probability of an event happening with the probability that it does not happen

If \(p\) = 0.25 the odds are \(\frac{0.25}{1 - 0.25} = \frac{1}{3}\) or 1 success for every 3 failures

If \(p\) = 0.8 the odds are \(\frac{0.8}{1 - 0.8} = 4\) or 4 success for each failure

The logit is the natural log of the odds; odds are non-negative number, but log odds can be any real number

These are all transformations that we can undo / reverse!

Logistic Regression

\[\log \left ( \frac{p}{1 - p} \right ) = \beta_0 + \beta_{1} x_1\]

\(\beta_0\) a constant term; predicted log-odds of success (high SCC) where \(x\) is 0 (0 milk yield)

\(\beta_1\) is the slope; predicted change in the log-odds of success (high scc) for a 1 unit increase in \(x\) (an additional KG of milk per day)

Because log-odds can take any value

  • \(\beta_1 \mathsf{> 1}\); increasing \(x\) associated with an increase in probability of success
  • \(\beta_1 \mathsf{< 1}\); increasing \(x\) associated with a decrease in probability of success
  • \(\beta_1 \mathsf{= 0}\); increasing \(x\) has no effect on probability of success

Logistic Regression

A method known as maximum likelihood (ML) is used to estimate values of \(\beta_0\) and \(\beta_j\) from the data, the maximum likelihood estimates (MLEs)

So called because these estimates are the ones that make the data most likely under the model

In linear regression we minimised the residual sums of squares

With GLMs we minimise the deviance — think of this as the lack of fit of the model

Logistic Regression

m <- glm(high_scc ~ milk_yield_kg, data = scc, family = binomial)

summary(m)

Call:
glm(formula = high_scc ~ milk_yield_kg, family = binomial, data = scc)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.819109   0.195370   4.193 2.76e-05 ***
milk_yield_kg -0.035740   0.009306  -3.841 0.000123 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1577.2  on 1139  degrees of freedom
Residual deviance: 1562.2  on 1138  degrees of freedom
AIC: 1566.2

Number of Fisher Scoring iterations: 4

The output looks the same, so do not sweat the detail if it is unclear 😰

Logistic Regression

Can compare likelihoods of two models using a likelihood ratio test

m <- glm(high_scc ~ milk_yield_kg, data = scc, family = binomial)
m0 <- glm(high_scc ~ 1, data = scc, family = binomial)
anova(m0, m, test = "LRT")
Analysis of Deviance Table

Model 1: high_scc ~ 1
Model 2: high_scc ~ milk_yield_kg
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1      1139     1577.2                          
2      1138     1562.2  1   15.056 0.0001044 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that compared to the null model of no change in risk of high SCC, the milk yield of a cow has a statistically interesting effect on the probability of high SCC

We would reject H0: the data we observed is very unlikely to have arisen if H0 were true (about 1 in 10000 chance)

Logistic Regression

printCoefmat(coef(summary(m)), digits = 4)
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.819109   0.195370   4.193 2.76e-05 ***
milk_yield_kg -0.035740   0.009306  -3.841 0.000123 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
round(exp(coef(m)), 3) # exp() is the inverse or opposite of log()
  (Intercept) milk_yield_kg 
        2.268         0.965 

Estimate contains the estimates for \(\beta_j\) in log odds; easier to interpret on odds scale

Logistic Regression

printCoefmat(coef(summary(m)), digits = 4)
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.819109   0.195370   4.193 2.76e-05 ***
milk_yield_kg -0.035740   0.009306  -3.841 0.000123 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Tests of \(H_0: \; \beta_j = 0\) are done using a Wald statistic instead of the \(t\) test

\[z = \frac{\beta_j - 0}{\mathsf{SE}(\beta_j)}\]

Rather than following a \(t\) distribution, Wald statistics are approximately normally distributed; probability of seeing a \(z\) as extreme as observed under a standard normal distribution

Logistic Regression

printCoefmat(coef(summary(m)), digits = 4)
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    0.819109   0.195370   4.193 2.76e-05 ***
milk_yield_kg -0.035740   0.009306  -3.841 0.000123 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
round(exp(coef(m)), 3) # exp() is the inverse or opposite of log()
  (Intercept) milk_yield_kg 
        2.268         0.965 

Odds of high SCC as a non-yielding cow are 2.268

\(\beta_1\) is negative, so higher yielding cows are associated with decreased risk of high SCC; for each additional kg of yield, odds of high SCC decrease 0.965 times

Why times? additive effects on the log scale are multiplicative on the non-log scale

Odds ratio

Odds ratio of High SCC where milk_yield_kg is 20 and 25 kg

exp(-0.036 * (25 - 20))
[1] 0.8352702

(Note: \(\beta_1\) = -0.036)

Don’t do it by hand, use comparisons()

m |> comparisons(
    variables = list(milk_yield_kg = c(20, 25)),
    comparison = "lnoravg", transform = "exp"
  )

 Estimate Pr(>|z|)    S 2.5 % 97.5 %
    0.836   <0.001 13.0 0.763  0.916

Term: milk_yield_kg
Type: response
Comparison: ln(odds(25) / odds(20))

transform = "exp" because want inverse of log(); "lnoravg" requests the log-odds ratio as the comparison

Risk ratio

If those sound complicated, welcome to the club because they are!

It is much more useful to describe the model in terms of the predicted probabilities or as risk ratios

A risk ratio or relative risk is the ratio of the risk under two conditions

  • with or without a risk factor.
  • increased milk yield from 20 to 25 kg

\[ \text{risk} = \frac{\text{number of cases}}{\text{number at risk}} \]

\[ \text{risk ratio} = \frac{\text{risk}_a}{\text{risk}_b} \]

Risk ratio

Compute the risks: Estimate is probability of SCC for the two milk yields (20 & 25)

m |> predictions(newdata = datagrid(milk_yield_kg = c(20, 25)),
  type = "response")

 milk_yield_kg Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
            20    0.526     0.0149 35.3   <0.001 906.2 0.497  0.555
            25    0.481     0.0189 25.5   <0.001 472.4 0.444  0.518

Type: response

We want the ratio of these two risks (0.526 & 0.481):

m |> comparisons(variables = list(milk_yield_kg = c(20, 25)),
  comparison = "ratioavg")

 Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
    0.915     0.0222 41.2   <0.001 Inf 0.872  0.959

Term: milk_yield_kg
Type: response
Comparison: mean(25) / mean(20)

Logistic Regression

If we want the predicted probability of high SCC for our two values of milk yield, these are the model predictions

Note: the response in a logistic regression model is the probability of “success”, where in this case “success” is “high SCC”

We just saw these probabilities, but it is worth repeating them

m |> predictions(newdata = datagrid(milk_yield_kg = c(20, 25)))

 milk_yield_kg Estimate Pr(>|z|)   S 2.5 % 97.5 %
            20    0.526   0.0806 3.6 0.497  0.555
            25    0.481   0.3262 1.6 0.445  0.519

Type: invlink(link)

We state which values of milk_yiled_kg we want predictions for using datagrid()

Logistic Regression

We can do this for more than just a couple of values of milk_yield_kg

plot_predictions() will draw graphs of the model predictions conditioned on milk_yield_kg

ll <- labs(y = "Prob. of high SCC", x = "Milk yield (kg)")
m |> plot_predictions(by = "milk_yield_kg", type = "response") + ll

Logistic Regression

Here, we extrapolate well beyond range of the data

Approximation used to create the confidence band starts to fail

m |> plot_predictions(
    by = "milk_yield_kg",
    type = "response",
    newdata = datagrid(milk_yield_kg = seq(3, 100, by = 1))
  ) + ll

Generalised Linear Models

Generalised Linear Models

Generalised Linear Models (GLMs) are a powerful extension to the linear regression model, extending the types of data & conditional distributions that can be modelled beyond the normal or Gaussian distribution of linear regression

  • Binary (dichotomous) variables can be modelled using logistic regression

  • Count data can be modelled using Poisson regression

  • Milk yield, animal weight, or other positive, continuous variables can be modelled using a gamma regression

All are special cases of the broad class of GLMs

Also includes linear regression as a special case

Generalised Linear Models

Logistic regression is a special case of the the GLM where the conditional distribution of the response was assumed to be Binomial

GLMs allow the conditional distribution of the response to be any distribution from the exponential family; Poisson, binomial, Gaussian, gamma, multinomial, …

There are three parts to a GLM

  • The conditional distribution of \(y\)
  • The linear predictor \(\eta\), and
  • The link function

Whilst this affords a wealth of options, often natural choices for the conditional distribution of \(y\) and the link function arise from type of data being modelled

GLM: Conditional distribution

In a GLM we want a model for the expectation of \(y\), \(\mathbb{E}(y_i)\), which here will usually be the mean of the response, \(\mu_i\)

We might model \(\mu_i\) as following a Poisson distribution if the data were counts, or as a binomial distribution in the case of 0/1 data, as we did the in the high SCC example

GLM: Linear predictor

We need to decide which predictor variables and any transformations of them should be used to predict \(y\); this is the linear predictor, \(\eta\)

\[\eta_i = \beta_0 + \beta_{i1} x_{i1} + \beta_{i2} x_{i2} + \cdots + \beta_{ij}\]

However, was we saw in the Gaussian linear model, there is nothing stopping the above equation returning values that don’t make sense for the variable we are modelling

What are we doing?

Imagine we observed how the counts of some thing (event, cases of disease etc) varied with a covariate \(x\)

What are we doing?

What if we fit a Gaussian linear model to these data?

What are we doing?

The model diagnostics are not great either

What are we doing?

We fit a Poisson GLM to these data

m_pois <- glm(y_obs ~ x, data = df, family = poisson(link = "log"))

What are we doing?

What are we doing?

What are we doing?

Binomial GLM (0/1 data) — logistic regression

What are we doing?

Binomial GLM (counts from a total) — Number of still born piglets in litters

What are we doing?

Gamma GLM — milk yield, weight, etc.

What are we doing?

What would diagnostics for a Gaussian model fitted to those data look like?

What are we doing?

Inverse Gaussian GLM — milk yield, weight, etc.: like the gamma, just more extreme

Poisson GLM

Poisson GLM

Often, we’ll encounter count data of some sort; number of cases of a disease in an epidemic

Counts are strictly non-negative; can’t have a count of less than 0

Variance increases as a function of the mean (lambda; \(\lambda\))

Poisson GLM

The Poisson distribution is a standard distribution for modelling count data

The Poisson gives the distribution of the number of things (individuals, counts, events) in a given sampling interval/effort if each event is independent

The Poisson is a simple distribution; defined by a single parameter \(\lambda\), which is the mean and the variance

The standard link function for a Poisson GLM is the log link; maps non-negative counts on to a linear scale

Poisson GLM

\[\begin{align*} y_i & \sim \text{Poisson}(\mu_i) \\ \lambda_i & = \mathbb{E}(y_i) = \mu_i \\ \log(\mu_i) & = \beta_0 + \beta_1 x_1 + \cdots \end{align*}\]

Poisson GLM: Example

Association between endometritis and breed of cattle in Danish dairy herds

endo <- tibble::tribble(
  ~ breed,    ~ endometritis, ~ count,
  "Holstein", "Yes", 100,
  "Holstein", "No", 400,
  "Jersey", "Yes", 10,
  "Jersey", "No", 190
) |>
mutate(
  breed = factor(breed),
  endometritis = factor(endometritis, levels = c("No", "Yes"))
)
endo
# A tibble: 4 × 3
  breed    endometritis count
  <fct>    <fct>        <dbl>
1 Holstein Yes            100
2 Holstein No             400
3 Jersey   Yes             10
4 Jersey   No             190

Poisson GLM: Example

Data like this are common in biology — contingency tables

Yes No Total
Holstein 100 400 500
Jersey 10 190 200
Total 110 590 700

A key question of interest is independence of the effects of variables on the cell frequencies

Poisson GLM: Example

Analysis of deviance table similar to sequential SS from linear regression

endo1 <- glm(count ~ breed + endometritis,
  data = endo, family = poisson(link = "log"))
anova(endo1, test = "Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: count

Terms added sequentially (first to last)

             Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                             3     523.43              
breed         1   132.83         2     390.60 < 2.2e-16 ***
endometritis  1   361.54         1      29.05 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Poisson GLM: interaction

Add an interaction between breed and endometritis to the model & refit

endo2 <- update(endo1, . ~ . + breed:endometritis)
summary(endo2)

Call:
glm(formula = count ~ breed + endometritis + breed:endometritis, 
    family = poisson(link = "log"), data = endo)

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  5.99146    0.05000 119.829  < 2e-16 ***
breedJersey                 -0.74444    0.08811  -8.449  < 2e-16 ***
endometritisYes             -1.38629    0.11180 -12.399  < 2e-16 ***
breedJersey:endometritisYes -1.55814    0.34317  -4.540 5.61e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance:  5.2343e+02  on 3  degrees of freedom
Residual deviance: -3.9968e-14  on 0  degrees of freedom
AIC: 33.517

Number of Fisher Scoring iterations: 3

Poisson GLM: interaction

Compare models using likelihood ratio test

anova(endo1, endo2, test = "LRT")
Analysis of Deviance Table

Model 1: count ~ breed + endometritis
Model 2: count ~ breed + endometritis + breed:endometritis
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
1         1     29.054                         
2         0      0.000  1   29.054 7.04e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare models using AIC

AIC(endo1, endo2)
      df      AIC
endo1  3 60.57105
endo2  4 33.51737

Butterfly colour

Mayekar & Kodandaramaiah (2017; PLoS One, 12, e0171482) studied determinants of pupal colour (green vs. brown) of a tropical butterfly

Response

  • Green vs Brown

Predictors

  • time to pupation
  • pupal weight
  • sex

Discuss in your groups:

  • what types of data each of the variables is?
  • what kind of GLM is appropriate?

Problematic grizzly bears

Moorehouse et al (2016; PLoS One, 11, e0165425) used genetic analyses to determine parentage of bear cubs and cross classified cubs and mothers and whether they caused problems around humans

Problematic grizzly bears

Table shows the number of cubs and mothers classified by whether they are problematic or not

Mother Total
Yes No
Offspring Yes 5 18 23
No 3 50 53
Total 8 68 76

Discuss in your groups:

  • what types of data each of the variables is?
  • what kind of GLM is appropriate?

Cat Owner’s Attitudes

Teng et al (2020; PLoS One, 15, e0234190) surveyed cat owners in Australia on factors that might affect affect the prevalence of overweight and obese cats

Body Condition Scores

  1. very underweight
  2. somewhat underweight
  3. ideal
  4. chubby/overweight
  5. fat/obese

Begging behaviour

  1. Never
  2. Sometimes
  3. Often
  4. Always

Cat Owner’s Attitudes

Begging? BCS 1 & 2 BC 3 BCS 4 & 5
Never 21 266 56
Sometimes 55 527 162
Often 14 106 69
Always 10 47 44

Discuss in your groups:

  • what types of data each of the variables is?
  • what kind of GLM is appropriate?